-
Notifications
You must be signed in to change notification settings - Fork 868
fix(exp): Add a real agent example #3507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds a travel-agent experiment module that runs travel queries, captures OpenTelemetry spans to extract trajectory prompts/completions and tool calls, provides utilities to extract trajectories and run tasks with an in-memory span capture, and updates an evaluator API to accept threshold and conditions parameters. Changes
Sequence DiagramsequenceDiagram
participant Orchestrator as run_travel_agent_experiment()
participant Task as travel_agent_task(row)
participant Agent as travel_agent_example.run_travel_query
participant Tracer as InMemory OTEL Exporter
participant Extract as extract_trajectory_from_spans()
participant Evaluator as EvaluatorMadeByTraceloop
Orchestrator->>Orchestrator: configure evaluators (threshold/conditions)
Orchestrator->>Orchestrator: load travel-queries dataset
loop per query row
Orchestrator->>Task: invoke travel_agent_task(row)
Task->>Tracer: reset/init in-memory exporter
Task->>Agent: run_travel_query(row.query)
Agent->>Tracer: emit spans (gen_ai.prompt.*, gen_ai.completion.*, gen_ai.tool.*)
Tracer-->>Task: finished spans captured
Task->>Extract: extract_trajectory_from_spans(spans)
Extract-->>Task: return trajectories (prompts, completions, tool calls)
Task-->>Orchestrator: return result with trajectory JSON and final_completion
end
Orchestrator->>Evaluator: run evaluators on collected results
Evaluator-->>Orchestrator: evaluation summaries
Orchestrator->>Orchestrator: print results summary
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
…nk/travel_agent_case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Changes requested ❌
Reviewed everything up to f19d82c in 2 minutes and 57 seconds. Click for details.
- Reviewed
355lines of code in2files - Skipped
0files when reviewing. - Skipped posting
3draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:33
- Draft comment:
Verify that 'spec' and 'spec.loader' are not None before using them to load the module. This prevents silent failures if the file is missing or the spec fails to load. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% The comment identifies a legitimate potential issue - if the file is missing or spec fails to load, the code will raise an AttributeError when trying to access spec.loader or call module_from_spec(None). However, the comment violates the rules in two ways: 1) It starts with "Verify that..." which is explicitly called out as a pattern to avoid, and 2) It's somewhat speculative about "silent failures" - if spec is None, this would actually raise an exception, not fail silently. The comment is asking the author to check/verify rather than stating a clear code change. This is a new file being added, so this is about changes made in the diff. While there is a real potential issue here (missing null checks), the phrasing "Verify that..." makes this an ask for the author to check rather than a clear actionable suggestion. Also, the failure wouldn't actually be "silent" - it would raise an AttributeError. The comment could be valid if rephrased more directly as "Add null checks for spec and spec.loader". The comment does identify a real code quality issue where proper error handling is missing. However, the phrasing violates the stated rules about not asking authors to "verify" or "ensure" things. The rules explicitly state "Do NOT ask the PR author to confirm their intention, to explain, to double-check things, to ensure the behavior is intended" and give "Verify that..." as an example of what not to do. This comment should be deleted because it uses the "Verify that..." pattern which is explicitly prohibited in the rules. While it identifies a legitimate issue, the phrasing asks the author to check/verify rather than stating a clear code change requirement.
2. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:293
- Draft comment:
Returned result keys ('trajectory_prompts'/'trajectory_completions') differ from what the sample results block expects ('prompts'/'completions'). Ensure consistent naming or proper field mapping. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 35% vs. threshold = 50% The comment identifies a legitimate inconsistency between what the function returns and what the results printing code expects. However, I need to consider: 1) This is in a print/logging section that displays sample results, not core functionality. 2) The .get() method with a default value means this won't crash - it will just show incorrect/empty values. 3) The comment says "Ensure consistent naming or proper field mapping" which is somewhat vague and doesn't provide a clear action. 4) According to the rules, I should not keep comments that ask the author to "ensure" things. 5) The actual evaluators might work fine with 'trajectory_prompts'/'trajectory_completions' - the issue is only in the display code. 6) This is a new file being added, so this is part of the changes. The comment does identify a real bug where the results printing will show incorrect values. This isn't just informational - it's pointing out that lines 299-300 will not work as intended because they're looking for keys that don't exist in the returned dict. The comment starts with "Ensure" which typically violates the rules, but it's also identifying a concrete mismatch that would lead to incorrect output. While the comment uses "Ensure" language which is typically discouraged, it does identify a concrete bug where the print statements will fail to display the intended information. The .get() calls will return empty strings and the split operations will produce incorrect counts. This is actionable - the author should either change the return keys or the print statement keys to match. It's not speculative or asking for confirmation. This comment identifies a real inconsistency that will cause the results display to show incorrect information. While it uses "Ensure" language, it points to a concrete bug in the code. However, the phrasing could be more direct about what needs to be fixed. Given the rules emphasize removing comments unless there's STRONG EVIDENCE they're correct, and this is a relatively minor display issue (not core functionality), I'll lean toward removing it.
3. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:136
- Draft comment:
Rather than directly deleting the 'TracerWrapper.instance', consider implementing a proper reset API for the tracer to avoid potential side effects. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% This is a code quality suggestion about using a proper API instead of directly manipulating internal state. However, there are several issues: 1) This is a new file being added, so the author clearly wrote this intentionally as a workaround, 2) The comment doesn't provide evidence that a reset API exists in TracerWrapper, 3) Without seeing the TracerWrapper implementation, we can't know if this approach is problematic or if there's a better alternative, 4) The comment is speculative ("potential side effects") without demonstrating actual issues, 5) The local variable assignment on line 137 suggests the author may have had a reason for this approach (possibly to keep a reference). The rules state we should only keep comments with STRONG EVIDENCE they are correct, and speculative comments should be removed. The comment might be valid if TracerWrapper does have a proper reset API that should be used instead. The author might not be aware of it. Also, directly deleting singleton instances can indeed cause issues in some cases. However, without evidence that such an API exists or that this approach causes problems, this is speculative. While the concern about proper APIs is generally valid, this comment lacks strong evidence. It doesn't show that a reset API exists, doesn't demonstrate actual side effects, and is speculative ("potential side effects"). The code appears intentional given the comment on line 135 explaining the purpose. Without access to TracerWrapper's implementation or evidence of problems, this is a speculative suggestion that should be removed per the rules. This comment should be deleted. It's a speculative suggestion without strong evidence of an actual problem or proof that a better API exists. The rules require strong evidence for keeping comments, and this doesn't meet that bar.
Workflow ID: wflow_kMz1aMaasqEpWUhZ
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (2)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (2)
31-38: Guard dynamic travel agent loading for clearer errors
_load_travel_agent_moduleassumes theagents_pathexists and thatspec/spec.loaderis non-None. If the file is missing or the path changes, this will fail with a somewhat opaqueAttributeError.You could make failures clearer:
def _load_travel_agent_module(): """Dynamically load the travel_agent_example module.""" agents_path = Path(__file__).parent.parent.parent / "agents" / "travel_agent_example.py" - spec = importlib.util.spec_from_file_location("travel_agent_example", agents_path) - module = importlib.util.module_from_spec(spec) - sys.modules["travel_agent_example"] = module - spec.loader.exec_module(module) - return module + if not agents_path.exists(): + raise FileNotFoundError(f"travel_agent_example.py not found at {agents_path}") + + spec = importlib.util.spec_from_file_location("travel_agent_example", agents_path) + if spec is None or spec.loader is None: + raise ImportError(f"Could not load spec for travel_agent_example from {agents_path}") + + module = importlib.util.module_from_spec(spec) + sys.modules["travel_agent_example"] = module + spec.loader.exec_module(module) + return moduleThis keeps the example robust while failing fast with a helpful message if the file isn’t where it’s expected.
185-219: Normalize trajectory JSON handling and includetool_callsin the task resultThe current trajectory handling has a few rough edges:
- For empty trajectories, you set
trajectory_prompts = json.dumps([])/trajectory_completions = json.dumps([])and then later calljson.dumpsagain, which yields"[]"(a JSON string of a JSON string) instead of[].- The “Prompt attributes captured” / “Completion attributes captured” counts are based on mixed dict/string types, so counts become misleading when empty.
- The returned task result doesn’t expose
tool_calls, even though they’re collected and later referenced in the sample results.You can address all three by:
- # trajectory_prompts and trajectory_completions are dicts with llm.prompts/completions.* keys - # If empty, use JSON string fallback to avoid validation errors - trajectory_prompts = trajectory_data["trajectory_prompts"] - trajectory_completions = trajectory_data["trajectory_completions"] - - # Convert to JSON strings if empty (evaluators expect string when no data) - if not trajectory_prompts: - trajectory_prompts = json.dumps([]) - if not trajectory_completions: - trajectory_completions = json.dumps([]) - - print("📊 Trajectory Summary:") - print(f" - Prompt attributes captured: {len(trajectory_prompts)}") - print(f" - Completion attributes captured: {len(trajectory_completions)}") + # trajectory_prompts and trajectory_completions are dicts with llm.prompts/completions.* keys + trajectory_prompts = trajectory_data["trajectory_prompts"] + trajectory_completions = trajectory_data["trajectory_completions"] + + prompt_attr_count = len(trajectory_prompts) + completion_attr_count = len(trajectory_completions) + + # Convert to JSON strings for evaluator input; use [] when empty + json_trajectory_prompts = json.dumps(trajectory_prompts or []) + json_trajectory_completions = json.dumps(trajectory_completions or []) + + print("📊 Trajectory Summary:") + print(f" - Prompt attributes captured: {prompt_attr_count}") + print(f" - Completion attributes captured: {completion_attr_count}") @@ - # - "context" maps to "reference" - # - "trajectory_prompts" and "trajectory_completions" as dicts with llm.prompts/completions.* keys - - json_trajectory_prompts = json.dumps(trajectory_prompts) - json_trajectory_completions = json.dumps(trajectory_completions) - # prompt_list = str(trajectory_prompts) - # completion_list = str(trajectory_completions) - - return { - "prompt": query, - "answer": final_completion if final_completion else query, - "context": f"The agent should create a complete travel itinerary for: {query}", - "trajectory_prompts": json_trajectory_prompts, - "trajectory_completions": json_trajectory_completions, - } + # - "context" maps to "reference" + # - "trajectory_prompts" and "trajectory_completions" as JSON strings with llm.prompts/completions.* keys + + return { + "prompt": query, + "answer": final_completion if final_completion else query, + "context": f"The agent should create a complete travel itinerary for: {query}", + "trajectory_prompts": json_trajectory_prompts, + "trajectory_completions": json_trajectory_completions, + "tool_calls": trajectory_data["tool_calls"], + }This keeps evaluator inputs JSON-encoded (with a clean
[]for the empty case), fixes the counts, and surfacestool_callsfor downstream inspection.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py(1 hunks)packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.pypackages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
🧬 Code graph analysis (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)
EvaluatorDetails(5-23)
🪛 Flake8 (7.3.0)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py
[error] 137-137: local variable '_trace_wrapper_instance' is assigned to but never used
(F841)
🪛 Ruff (0.14.7)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py
213-219: Consider moving this statement to an else block
(TRY300)
221-222: Remove exception handler; error is immediately re-raised
(TRY203)
222-222: Use raise without specifying exception name
Remove exception name
(TRY201)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
571-571: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Test Packages (3.12)
- GitHub Check: Test Packages (3.11)
- GitHub Check: Test Packages (3.10)
- GitHub Check: Build Packages (3.11)
- GitHub Check: Lint
🔇 Additional comments (2)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)
241-255: Docstring update forword_count_ratiolooks correctIncluding
denominator_textin the “Required task output fields” aligns the docstring with the actualrequired_input_fields. No changes needed here.packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (1)
292-300: Fix sample results printing to match actual task return structureThe code at lines 298-300 accesses keys that don't exist in
task_result. Thetravel_agent_taskfunction returnsprompt,answer,context,trajectory_prompts, andtrajectory_completions, but nottool_calls,prompts, orcompletions.Update to:
print(f" - Tools used: {len(task_result.get('tool_calls', []))}") prompts_json = task_result.get("trajectory_prompts", "[]") completions_json = task_result.get("trajectory_completions", "[]") try: prompts_obj = json.loads(prompts_json) completions_obj = json.loads(completions_json) except json.JSONDecodeError: prompts_obj = {} completions_obj = {} print(f" - Prompt attributes captured: {len(prompts_obj)}") print(f" - Completion attributes captured: {len(completions_obj)}")Note: If
tool_callsshould be displayed here, it must first be added to the return dict oftravel_agent_task(currently it's extracted inextract_trajectory_from_spansbut not included in the final return).
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py
Show resolved
Hide resolved
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py
Outdated
Show resolved
Hide resolved
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py
Outdated
Show resolved
Hide resolved
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed 5dab9df in 1 minute and 38 seconds. Click for details.
- Reviewed
71lines of code in1files - Skipped
0files when reviewing. - Skipped posting
5draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:27
- Draft comment:
Removed redundant comment before the module loading function. Consider adding a more descriptive module-level docstring if needed. - Reason this comment was not posted:
Comment looked like it was already resolved.
2. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:135
- Draft comment:
Removed the unused binding to _trace_wrapper_instance when clearing the singleton. This cleanup eliminates clutter without affecting functionality. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%This comment is purely informative, explaining a change that was made without suggesting any action or asking for clarification. It doesn't provide any actionable feedback or raise any concerns about the code.
3. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:196
- Draft comment:
Refactored assignment of 'tools_called' improves readability of the print statement. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply states that the refactoring improves readability, which is not necessary for the PR author to know.
4. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:208
- Draft comment:
Updated comment to reflect that 'trajectory_prompts' and 'trajectory_completions' are returned as JSON strings, aligning with the fallback behavior when they are empty. Verify that downstream evaluators expect a JSON string rather than a dict. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%The comment is asking the PR author to verify that downstream evaluators expect a JSON string rather than a dict. This falls under the category of asking the author to confirm or verify something, which is against the rules. Therefore, this comment should be removed.
5. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:270
- Draft comment:
Experiment slug updated from 'travel-agent-exp-2' to 'travel-agent-exp-3'. Ensure that related documentation and platform configurations are updated accordingly. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%The comment is asking the PR author to ensure that related documentation and platform configurations are updated, which falls under the rule of not asking the author to ensure things are tested or verified. This is not a specific code suggestion or a request for a test to be written.
Workflow ID: wflow_nw1dA9YKK30uCA8Z
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: traceloop/openllmetry PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-17T15:06:48.109Z
Learning: For debugging OpenTelemetry spans, use ConsoleSpanExporter with Traceloop to print spans to console
🪛 Ruff (0.14.7)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py
212-218: Consider moving this statement to an else block
(TRY300)
220-221: Remove exception handler; error is immediately re-raised
(TRY203)
221-221: Use raise without specifying exception name
Remove exception name
(TRY201)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Test Packages (3.10)
- GitHub Check: Test Packages (3.11)
- GitHub Check: Test Packages (3.12)
- GitHub Check: Build Packages (3.11)
- GitHub Check: Lint
|
|
||
| return { | ||
| "prompt": query, | ||
| "answer": final_completion if final_completion else query, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fallback to query as answer is semantically incorrect.
If final_completion is empty, the code returns the input query as the answer. This conflates the question with the response. Consider returning an empty string or a placeholder like "No completion captured" to preserve data integrity.
Apply this diff:
- "answer": final_completion if final_completion else query,
+ "answer": final_completion if final_completion else "No completion captured",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "answer": final_completion if final_completion else query, | |
| "answer": final_completion if final_completion else "No completion captured", |
🤖 Prompt for AI Agents
In
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py
around line 214, the current fallback returns the input query as the answer when
final_completion is empty; change this to return a safe fallback (e.g., an empty
string "" or the placeholder "No completion captured") instead of the query to
avoid conflating question and response, so update the "answer" value to
final_completion if present else "" (or your chosen placeholder) and keep the
rest of the structure unchanged.
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py
Outdated
Show resolved
Hide resolved
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Changes requested ❌
Reviewed 8ea4145 in 2 minutes and 17 seconds. Click for details.
- Reviewed
79lines of code in1files - Skipped
0files when reviewing. - Skipped posting
2draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:26
- Draft comment:
Using sys.path modification instead of dynamic module loading simplifies import but consider using a proper package structure to avoid side effects. - Reason this comment was not posted:
Confidence changes required:33%<= threshold50%None
2. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:112
- Draft comment:
Removal of the dynamic module assignment in travel_agent_task in favor of the direct import improves clarity. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
Workflow ID: wflow_qUQO06CtHHsx3h0A
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
|
|
||
| for key, value in attributes.items(): | ||
| if key.startswith("gen_ai.prompt."): | ||
| trajectory_prompts_dict[key] = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removal of key conversion (gen_ai.prompt -> llm.prompts) may conflict with evaluator expectations. Update the docstring or conversion logic if this change is intentional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (4)
105-111: Ensure returned task fields match documented “Required fields” and planned evaluatorsYour docstring lists
tool_callsas a required field, and you’re already computing tool calls (plus tool inputs/outputs) inextract_trajectory_from_spans, but the task result currently omits them. The diff above addstool_calls; if you intend to plug inagent_tool_error_detectorlater, consider also returningtool_inputsandtool_outputsfromtrajectory_dataso the evaluator has everything it needs.Also applies to: 182-197
269-278: Result summary uses wrong keys and doesn’t decode JSON trajectoriesIn the sample results, you read
prompts/completions, buttravel_agent_taskreturnstrajectory_prompts/trajectory_completionsas JSON strings. That will always fall back to''and counts won’t reflect the captured data. You can decode the JSON and count elements instead:- task_result = result.task_result - print(f" - Query: {task_result.get('prompt', 'N/A')[:100]}...") - print(f" - Tools used: {len(task_result.get('tool_calls', []))}") - print(f" - Prompts captured: {len(task_result.get('prompts', '').split('---'))}") - print(f" - Completions captured: {len(task_result.get('completions', '').split('---'))}") + task_result = result.task_result + print(f" - Query: {task_result.get('prompt', 'N/A')[:100]}...") + print(f" - Tools used: {len(task_result.get('tool_calls', []))}") + prompts_str = task_result.get("trajectory_prompts", "[]") + completions_str = task_result.get("trajectory_completions", "[]") + try: + prompts_data = json.loads(prompts_str) + except (TypeError, json.JSONDecodeError): + prompts_data = [] + try: + completions_data = json.loads(completions_str) + except (TypeError, json.JSONDecodeError): + completions_data = [] + print(f" - Prompts captured: {len(prompts_data)}") + print(f" - Completions captured: {len(completions_data)}")This keeps the summary aligned with the actual task output format.
1-10: Evaluator list is inconsistent (docs/prints say 5, code runs 4)Module and function docstrings plus console output describe five evaluators including “Agent Tool Error Detector”, but
evaluatorsonly contains four (agent_goal_accuracy,agent_flow_quality,agent_efficiency,agent_goal_completeness). Either addEvaluatorMadeByTraceloop.agent_tool_error_detector(...)(and wire the required fields intotravel_agent_task) or update all three locations to describe only the four actually used, to avoid confusing users of the example.Also applies to: 203-224
129-197: Clean up trajectory encoding, answer fallback, and outertry/exceptintravel_agent_taskThis block has a few issues: double‑JSON‑encoding empty trajectories, using the query text as a fallback “answer”, and a
try/exceptthat just re‑raises the same exception (flagged by Ruff). The example below normalizes empties without double‑encoding, keeps attribute counts correct, returns a safer fallback answer, includestool_callsin the task result, and removes the uselesstry/except:- try: - # Run the travel agent query - print(f"\n{'='*80}") - print(f"Running travel agent for query: {query}") - print(f"{'='*80}\n") - - tool_calls_made = await run_travel_query(query) - - # Get all captured spans - spans = exporter.get_finished_spans() - - print(f"\n{'='*80}") - print(f"Captured {len(spans)} spans from travel agent execution") - print(f"{'='*80}\n") - - # Extract trajectory from spans - trajectory_data = extract_trajectory_from_spans(spans) - - # Get the final completion from llm.completions dict - completions_dict = trajectory_data["trajectory_completions"] - final_completion = "" - if completions_dict: - # Find the highest index completion content - max_idx = -1 - for key in completions_dict.keys(): - if ".content" in key: - try: - parts = key.split(".") - idx = int(parts[2]) - if idx > max_idx: - max_idx = idx - final_completion = completions_dict[key] - except (ValueError, IndexError): - pass - - # trajectory_prompts and trajectory_completions are dicts with llm.prompts/completions.* keys - # If empty, use JSON string fallback to avoid validation errors - trajectory_prompts = trajectory_data["trajectory_prompts"] - trajectory_completions = trajectory_data["trajectory_completions"] - - # Convert to JSON strings if empty (evaluators expect string when no data) - if not trajectory_prompts: - trajectory_prompts = json.dumps([]) - if not trajectory_completions: - trajectory_completions = json.dumps([]) - - print("📊 Trajectory Summary:") - print(f" - Prompt attributes captured: {len(trajectory_prompts)}") - print(f" - Completion attributes captured: {len(trajectory_completions)}") - tools_called = ', '.join(trajectory_data['tool_calls']) if trajectory_data['tool_calls'] else 'None' - print(f" - Tools called: {tools_called}") - print(f" - Tools from run: {', '.join(tool_calls_made) if tool_calls_made else 'None'}\n") - - # Return data using field synonyms that map to required evaluator fields - # - "prompt" maps to "question" - # - "answer" maps to "completion" - # - "context" maps to "reference" - # - "trajectory_prompts" and "trajectory_completions" as JSON strings - - json_trajectory_prompts = json.dumps(trajectory_prompts) - json_trajectory_completions = json.dumps(trajectory_completions) - - return { - "prompt": query, - "answer": final_completion if final_completion else query, - "context": f"The agent should create a complete travel itinerary for: {query}", - "trajectory_prompts": json_trajectory_prompts, - "trajectory_completions": json_trajectory_completions, - } - - except Exception as e: - raise e + # Run the travel agent query + print(f"\n{'='*80}") + print(f"Running travel agent for query: {query}") + print(f"{'='*80}\n") + + tool_calls_made = await run_travel_query(query) + + # Get all captured spans + spans = exporter.get_finished_spans() + + print(f"\n{'='*80}") + print(f"Captured {len(spans)} spans from travel agent execution") + print(f"{'='*80}\n") + + # Extract trajectory from spans + trajectory_data = extract_trajectory_from_spans(spans) + + # Get the final completion from llm.completions dict + completions_dict = trajectory_data["trajectory_completions"] + final_completion = "" + if completions_dict: + # Find the highest index completion content + max_idx = -1 + for key in completions_dict.keys(): + if ".content" in key: + try: + parts = key.split(".") + idx = int(parts[2]) + if idx > max_idx: + max_idx = idx + final_completion = completions_dict[key] + except (ValueError, IndexError): + pass + + # trajectory_prompts and trajectory_completions are dicts with llm.prompts/completions.* keys + trajectory_prompts = trajectory_data["trajectory_prompts"] + trajectory_completions = trajectory_data["trajectory_completions"] + + prompt_attr_count = len(trajectory_prompts) + completion_attr_count = len(trajectory_completions) + + # Normalise empty trajectories so we don't double‑encode JSON + if not trajectory_prompts: + trajectory_prompts = [] + if not trajectory_completions: + trajectory_completions = [] + + print("📊 Trajectory Summary:") + print(f" - Prompt attributes captured: {prompt_attr_count}") + print(f" - Completion attributes captured: {completion_attr_count}") + tools_called = ', '.join(trajectory_data['tool_calls']) if trajectory_data['tool_calls'] else 'None' + print(f" - Tools called: {tools_called}") + print(f" - Tools from run: {', '.join(tool_calls_made) if tool_calls_made else 'None'}\n") + + # Return data using field synonyms that map to required evaluator fields + # - "prompt" maps to "question" + # - "answer" maps to "completion" + # - "context" maps to "reference" + # - "trajectory_prompts" and "trajectory_completions" as JSON strings + + json_trajectory_prompts = json.dumps(trajectory_prompts) + json_trajectory_completions = json.dumps(trajectory_completions) + + return { + "prompt": query, + "answer": final_completion if final_completion else "No completion captured", + "context": f"The agent should create a complete travel itinerary for: {query}", + "trajectory_prompts": json_trajectory_prompts, + "trajectory_completions": json_trajectory_completions, + "tool_calls": trajectory_data.get("tool_calls", []), + }This should also clear the Ruff TRY201/203/300 and the earlier comments about double‑encoding and the semantically odd
answerfallback.Also applies to: 199-200
🧹 Nitpick comments (2)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (2)
26-31: Unused# noqa: E402directive onrun_travel_queryimportRuff reports this
noqaas unused; if E402 is not enabled in your lint config, you can drop the directive to reduce noise. If you do rely on Flake8’s E402, consider a dynamic import (e.g., viaimportlib) instead, so the import can live at the top of the file without needing# noqa.
37-51:extract_trajectory_from_spansdocstring doesn’t match behaviorThe docstring says you convert
gen_ai.prompt.*/gen_ai.completion.*tollm.prompts.*/llm.completions.*, but the implementation simply filtersgen_ai.*keys without renaming. Please either (a) actually remap the keys to thellm.*prefix, or (b) update the docstring/inline comment so expectations for downstream evaluators are accurate.Also applies to: 61-66
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py
🧬 Code graph analysis (1)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (5)
packages/traceloop-sdk/traceloop/sdk/__init__.py (2)
Traceloop(36-267)init(48-198)packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (3)
agent_flow_quality(569-596)agent_efficiency(599-618)agent_goal_completeness(621-640)packages/traceloop-sdk/traceloop/sdk/utils/in_memory_span_exporter.py (2)
InMemorySpanExporter(22-61)get_finished_spans(40-43)packages/traceloop-sdk/traceloop/sdk/tracing/tracing.py (1)
TracerWrapper(63-227)packages/sample-app/sample_app/agents/travel_agent_example.py (1)
run_travel_query(700-719)
🪛 Ruff (0.14.7)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py
31-31: Unused noqa directive (non-enabled: E402)
Remove unused noqa directive
(RUF100)
191-197: Consider moving this statement to an else block
(TRY300)
199-200: Remove exception handler; error is immediately re-raised
(TRY203)
200-200: Use raise without specifying exception name
Remove exception name
(TRY201)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Test Packages (3.12)
- GitHub Check: Lint
- GitHub Check: Test Packages (3.11)
- GitHub Check: Test Packages (3.10)
- GitHub Check: Build Packages (3.11)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed 669c6f7 in 49 seconds. Click for details.
- Reviewed
22lines of code in1files - Skipped
0files when reviewing. - Skipped posting
2draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:64
- Draft comment:
Removed extra trailing whitespace on the assignment to trajectory_completions_dict. This improves code consistency. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
2. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:248
- Draft comment:
Updated experiment_slug from 'travel-agent-exp-3' to 'travel-agent-exp' for consistency and to fix potential naming issues. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
Workflow ID: wflow_ruck5OesSDuLF7rf
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (7)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (7)
1-14: Docstring lists 5 evaluators but only 4 are instantiated.The module docstring references "Agent Tool Error Detector" at line 7, but the evaluators list at lines 228-236 only instantiates four evaluators (missing
agent_tool_error_detector). This has been flagged in previous reviews.
193-193: Fallback to query as answer is semantically incorrect.If
final_completionis empty, returning the inputqueryas theanswerconflates the question with the response. This has been flagged in previous reviews.
199-200: Remove unnecessary try/except block.The try/except at lines 129-200 catches and immediately re-raises without adding logging or context. This has been flagged in previous reviews.
219-224: Printed list claims 5 evaluators but only 4 are instantiated.The console output lists "Agent Tool Error Detector" as evaluator #2, but it's not instantiated in the evaluators list at lines 228-236. This has been flagged in previous reviews.
277-278: Field name mismatch in result inspection.Lines 277-278 access
task_result.get('prompts', '')andtask_result.get('completions', ''), buttravel_agent_taskreturns keys namedtrajectory_promptsandtrajectory_completions(lines 195-196). This has been flagged in previous reviews.
166-173: Avoid double JSON encoding for empty trajectories.When
trajectory_promptsortrajectory_completionsare empty dicts, lines 171 and 173 convert them to JSON strings (json.dumps([])). Then lines 188-189 calljson.dumps()again on these strings, resulting in double-encoded JSON strings like"\"[]\""instead of"[]".Apply this diff to initialize with empty lists instead:
- # Convert to JSON strings if empty (evaluators expect string when no data) + # Initialize empty dicts as empty lists for consistent JSON encoding if not trajectory_prompts: - trajectory_prompts = json.dumps([]) + trajectory_prompts = [] if not trajectory_completions: - trajectory_completions = json.dumps([]) + trajectory_completions = []
37-47: Docstring incorrectly claims format conversion.The docstring at line 40 states "Converts gen_ai.prompt.* to llm.prompts.* format expected by evaluators" and line 46 claims the return dict has "llm.prompts.* keys", but the code at line 63 stores keys unchanged (still as
gen_ai.prompt.*). No conversion occurs.Apply this diff to correct the docstring:
def extract_trajectory_from_spans(spans): """ Extract prompt and completion trajectory from OpenTelemetry spans. - Converts gen_ai.prompt.* to llm.prompts.* format expected by evaluators. Args: spans: List of ReadableSpan objects from InMemorySpanExporter Returns: - dict with trajectory_prompts (as dict with llm.prompts.* keys), trajectory_completions, and tool_calls + dict with trajectory_prompts (as dict with gen_ai.prompt.* keys), trajectory_completions (as dict with gen_ai.completion.* keys), and tool_calls """
🧹 Nitpick comments (1)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (1)
31-31: Remove unusednoqadirective.The
noqa: E402comment is unnecessary as the E402 rule is not enabled in your linting configuration.Apply this diff:
-from travel_agent_example import run_travel_query # noqa: E402 +from travel_agent_example import run_travel_query
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py
🪛 Ruff (0.14.7)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py
31-31: Unused noqa directive (non-enabled: E402)
Remove unused noqa directive
(RUF100)
191-197: Consider moving this statement to an else block
(TRY300)
199-200: Remove exception handler; error is immediately re-raised
(TRY203)
200-200: Use raise without specifying exception name
Remove exception name
(TRY201)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Test Packages (3.12)
- GitHub Check: Test Packages (3.10)
- GitHub Check: Test Packages (3.11)
- GitHub Check: Build Packages (3.11)
- GitHub Check: Lint
🔇 Additional comments (1)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (1)
283-289: LGTM!The main block correctly sets up and executes the experiment with clear informational output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Changes requested ❌
Reviewed 9b03e5a in 2 minutes and 52 seconds. Click for details.
- Reviewed
379lines of code in2files - Skipped
0files when reviewing. - Skipped posting
3draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:56
- Draft comment:
Using run_with_span_capture cleans up the span extraction logic. Consider adding a note in the docstring that it returns JSON‐encoded trajectories to clarify the expected format downstream. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% The comment is suggesting a documentation improvement rather than pointing out a bug or required code change. According to the rules, comments should only be made if there is "clearly a code change required" and should not be "obvious or unimportant." This comment is about adding clarifying documentation, which is somewhat subjective. The comment also makes an assumption about whatrun_with_span_capturereturns (JSON-encoded trajectories) without me being able to verify this from the diff. The old code had explicit JSON encoding, but the new code just assigns the return values directly to dictionary fields. Without seeing the implementation ofrun_with_span_capture, I cannot confirm if the comment's premise is correct. The comment is also somewhat speculative about what would be helpful "downstream." The comment might be helpful if the return format is indeed unclear and causes confusion. Documentation improvements can prevent future bugs. However, I cannot verify from the diff alone whetherrun_with_span_capturereturns JSON strings or dictionaries, so the comment might be based on incorrect assumptions. While documentation can be helpful, this comment doesn't point to a clear code issue. It's a suggestion for a minor documentation enhancement that may or may not be necessary depending on how clear therun_with_span_capturefunction's own documentation is. The rules state to only keep comments where there's "STRONG EVIDENCE that the comment is correct" and this requires cross-file context to verify. This comment suggests a documentation improvement but doesn't identify a clear code defect. It requires understanding the implementation ofrun_with_span_capture(which is in another file) to verify if it's even correct. Since it's not pointing to a required code change and understanding it requires cross-file context, it should be deleted per the rules.
2. packages/traceloop-sdk/traceloop/sdk/experiment/utils.py:106
- Draft comment:
Consider using a logging framework instead of print statements for better production logging and control. - Reason this comment was not posted:
Confidence changes required:50%<= threshold50%None
3. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:59
- Draft comment:
Typo: In the comment, consider changing "This is the agents input" to "This is the agent's input" to correctly use the possessive form. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% This is a pure grammar/typo fix in a code comment. According to the rules, I should "Do NOT comment unless there is clearly a code change required" and "Do NOT make comments that are obvious or unimportant." A typo in a comment doesn't require a code change - the code functions identically. This is an extremely minor issue that doesn't affect functionality, readability in any meaningful way, or code quality. While the grammar correction is technically accurate, it falls into the category of "obvious or unimportant" comments. The comment is perfectly understandable as-is, and fixing this typo provides minimal value. Could this be considered important for code quality and professionalism? Some teams have strict standards about grammar in comments, and maintaining high-quality documentation could be valuable for long-term maintainability. While documentation quality can be important, this is an inline comment explaining a function parameter, not user-facing documentation. The meaning is completely clear even with the minor grammatical error. The rules explicitly state to avoid "obvious or unimportant" comments, and a possessive apostrophe in an inline comment clearly falls into this category. This comment should be deleted. It's a minor grammatical correction in an inline code comment that doesn't affect functionality, clarity, or code quality in any meaningful way. It violates the rule about not making obvious or unimportant comments.
Workflow ID: wflow_2cqqOe16JNMxUDXE
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
packages/traceloop-sdk/traceloop/sdk/experiment/utils.py (1)
141-160: Fix empty-trajectory handling to avoid double JSON encoding and incorrect counts.Current flow for empty trajectories:
trajectory_prompts/trajectory_completionsstart as{}.- For empties, they’re set to
json.dumps([])(a string"[]").- Later both are wrapped again with
json.dumps(...), yielding"\"[]\"".len(trajectory_prompts)/len(trajectory_completions)then measure string length (2), not attribute counts.This breaks consumers expecting a single-layer JSON string and makes the summary counts wrong.
Suggested change:
- Keep the internal values as structured objects (dict/list), only JSON-encode once at the end:
- trajectory_prompts = trajectory_data["trajectory_prompts"] - trajectory_completions = trajectory_data["trajectory_completions"] - - # Convert to JSON strings if empty (evaluators expect string when no data) - if not trajectory_prompts: - trajectory_prompts = json.dumps([]) - if not trajectory_completions: - trajectory_completions = json.dumps([]) + trajectory_prompts = trajectory_data["trajectory_prompts"] + trajectory_completions = trajectory_data["trajectory_completions"] + + # Use an empty list as the canonical “no trajectory” structure + if not trajectory_prompts: + trajectory_prompts = [] + if not trajectory_completions: + trajectory_completions = [] @@ - print(f" - Prompt attributes captured: {len(trajectory_prompts)}") - print(f" - Completion attributes captured: {len(trajectory_completions)}") + print(f" - Prompt attributes captured: {len(trajectory_prompts)}") + print(f" - Completion attributes captured: {len(trajectory_completions)}") @@ - json_trajectory_prompts = json.dumps(trajectory_prompts) - json_trajectory_completions = json.dumps(trajectory_completions) + json_trajectory_prompts = json.dumps(trajectory_prompts) + json_trajectory_completions = json.dumps(trajectory_completions)This preserves summary semantics and returns a single JSON layer while keeping the earlier reviewer feedback about empty lists in mind.
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (2)
1-14: Docstrings and console text advertise 5 evaluators, but only 4 are instantiated.
- Module and function docstrings, plus the printed description, list “Agent Tool Error Detector” and say there are five evaluators.
- The actual
evaluatorslist only includes:
agent_goal_accuracyagent_flow_qualityagent_efficiencyagent_goal_completenessPlease either:
- Add
EvaluatorMadeByTraceloop.agent_tool_error_detector(...)to the list (and ensure the task output exposes whatevertool_input/tool_outputfields that evaluator requires), or- Update all docstrings and printed descriptions to only mention the four evaluators you actually run.
Right now the example is misleading for users following the documented behavior.
Also applies to: 71-93, 95-104
35-68: Avoid using the input query as the fallback “answer”.In
travel_agent_task, the returned payload uses:"answer": final_completion if final_completion else query,When no completion is captured, this reports the original question as the answer, which corrupts downstream evaluation (e.g., goal accuracy, flow quality) and makes debugging harder.
Prefer a neutral or explicit fallback, e.g.:
- "answer": final_completion if final_completion else query, + "answer": final_completion if final_completion else "No completion captured",(or an empty string /
None, depending on how the evaluators handle missing completions).
🧹 Nitpick comments (1)
packages/traceloop-sdk/traceloop/sdk/experiment/utils.py (1)
163-164: Remove no-optry/exceptwrapper around the function body.The
except Exception as e: raise eadds no value and triggers Ruff TRY201/TRY203/TRY300 hints. You can safely drop the outertry/exceptand let exceptions propagate with their original traceback:- try: - # Run the task callable + # Run the task callable @@ - json_trajectory_completions = json.dumps(trajectory_completions) - - return json_trajectory_prompts, json_trajectory_completions, final_completion - - except Exception as e: - raise e + json_trajectory_completions = json.dumps(trajectory_completions) + + return json_trajectory_prompts, json_trajectory_completions, final_completion
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py(1 hunks)packages/traceloop-sdk/traceloop/sdk/experiment/utils.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.pypackages/traceloop-sdk/traceloop/sdk/experiment/utils.py
🧬 Code graph analysis (2)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (3)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (4)
agent_goal_accuracy(413-433)agent_flow_quality(569-596)agent_efficiency(599-618)agent_goal_completeness(621-640)packages/traceloop-sdk/traceloop/sdk/experiment/utils.py (1)
run_with_span_capture(68-164)packages/sample-app/sample_app/agents/travel_agent_example.py (1)
run_travel_query(700-719)
packages/traceloop-sdk/traceloop/sdk/experiment/utils.py (3)
packages/traceloop-sdk/traceloop/sdk/__init__.py (2)
Traceloop(36-267)init(48-198)packages/traceloop-sdk/traceloop/sdk/utils/in_memory_span_exporter.py (2)
InMemorySpanExporter(22-61)get_finished_spans(40-43)packages/traceloop-sdk/traceloop/sdk/tracing/tracing.py (1)
TracerWrapper(63-227)
🪛 Ruff (0.14.7)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py
29-29: Unused noqa directive (non-enabled: E402)
Remove unused noqa directive
(RUF100)
packages/traceloop-sdk/traceloop/sdk/experiment/utils.py
161-161: Consider moving this statement to an else block
(TRY300)
163-164: Remove exception handler; error is immediately re-raised
(TRY203)
164-164: Use raise without specifying exception name
Remove exception name
(TRY201)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Build Packages (3.11)
- GitHub Check: Test Packages (3.11)
- GitHub Check: Test Packages (3.10)
- GitHub Check: Test Packages (3.12)
- GitHub Check: Lint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed 140bb3a in 53 seconds. Click for details.
- Reviewed
15lines of code in1files - Skipped
0files when reviewing. - Skipped posting
2draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:57
- Draft comment:
Minor whitespace reformatting for clarity. No logic or performance issues. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
2. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:59
- Draft comment:
Typo: The comment 'This is the agents input' should likely be "This is the agent's input" (with an apostrophe) for grammatical correctness. - Reason this comment was not posted:
Comment was on unchanged code.
Workflow ID: wflow_gkEiIegWgGZNwqFw
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (4)
1-14: Docstring lists 5 evaluators but only 4 are implemented.The module docstring references "Agent Tool Error Detector" on line 7, but the evaluators list at lines 96-104 only instantiates four evaluators (agent_goal_accuracy, agent_flow_quality, agent_efficiency, agent_goal_completeness). Update the docstring to remove line 7 or add the missing evaluator to lines 96-104.
62-68: Fallback to query as answer is semantically incorrect.On line 64, when
final_completionis empty, the code returns the inputqueryas theanswer. This conflates the question with the response. Use an empty string or a placeholder like"No completion captured"instead.Apply this diff:
return { "prompt": query, - "answer": final_completion if final_completion else query, + "answer": final_completion if final_completion else "No completion captured", "context": f"The agent should create a complete travel itinerary for: {query}", "trajectory_prompts": trajectory_prompts, "trajectory_completions": trajectory_completions, }
73-81: Function docstring lists 5 evaluators but only 4 are implemented.The docstring references "Agent Tool Error Detector" in item 2, but the evaluators list at lines 96-104 only instantiates four evaluators. Update the docstring to match the actual implementation.
88-92: Print statements list 5 evaluators but only 4 are implemented.Lines 88-92 print a list that includes "Agent Tool Error Detector" on line 89, but the evaluators list at lines 96-104 only instantiates four evaluators. Update the print statements to match the actual implementation or add the missing evaluator.
🧹 Nitpick comments (1)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (1)
29-29: Remove unusednoqadirective.Ruff reports that the
noqa: E402directive is unused. Remove it to clean up the code.Apply this diff:
-from travel_agent_example import run_travel_query # noqa: E402 +from travel_agent_example import run_travel_query
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py
🧬 Code graph analysis (1)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (3)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (5)
EvaluatorMadeByTraceloop(5-729)agent_goal_accuracy(413-433)agent_flow_quality(569-596)agent_efficiency(599-618)agent_goal_completeness(621-640)packages/traceloop-sdk/traceloop/sdk/experiment/utils.py (1)
run_with_span_capture(68-164)packages/sample-app/sample_app/agents/travel_agent_example.py (1)
run_travel_query(700-719)
🪛 Ruff (0.14.7)
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py
29-29: Unused noqa directive (non-enabled: E402)
Remove unused noqa directive
(RUF100)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Build Packages (3.11)
- GitHub Check: Test Packages (3.11)
- GitHub Check: Lint
- GitHub Check: Test Packages (3.10)
- GitHub Check: Test Packages (3.12)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Changes requested ❌
Reviewed 1a8b20c in 1 minute and 38 seconds. Click for details.
- Reviewed
13lines of code in1files - Skipped
0files when reviewing. - Skipped posting
0draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
Workflow ID: wflow_0Zc7lUyWJMBji9uk
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
packages/traceloop-sdk/traceloop/sdk/experiment/utils.py (3)
104-105: Remove the no-op try/except that just re-raises the same exceptionThe
tryblock covers almost the entire function, and theexcept Exception as e: raise ehandler adds no value (and is what Ruff flags with TRY203/TRY201). Letting exceptions propagate naturally keeps tracebacks intact and simplifies control flow.You can just drop the
try/exceptentirely:- try: - # Run the task callable + # Run the task callable @@ - return json_trajectory_prompts, json_trajectory_completions, final_completion - - except Exception as e: - raise e + return json_trajectory_prompts, json_trajectory_completions, final_completionAlso applies to: 163-164
11-21: Alignextract_trajectory_from_spansdocstring with actual behavior and returned fieldsThe docstring still says gen_ai.* attributes are converted to
llm.prompts.*/llm.completions.*and that only prompts/completions/tool_calls are returned, but the implementation:
- Keeps keys as
gen_ai.prompt.*/gen_ai.completion.*without renaming.- Also returns
tool_inputsandtool_outputs.Either update the implementation to actually produce
llm.prompts.*/llm.completions.*and restrict the return shape, or (probably simpler here) update the docstring to describe that gen_ai.* keys are passed through unchanged and thattool_calls,tool_inputs, andtool_outputsare part of the contract.Also applies to: 59-65
139-160: Fix double JSON-encoding of empty trajectories and incorrect attribute countsWhen there are no prompts/completions,
trajectory_prompts/trajectory_completionsare set tojson.dumps([])and then wrapped again withjson.dumps(...), producing"\"[]\""instead of"[]". It also makes the “Prompt attributes captured” count showlen("[]") == 2instead of 0.You can avoid both issues and still return empty arrays as JSON like this:
- # trajectory_prompts and trajectory_completions are dicts with llm.prompts/completions.* keys - # If empty, use JSON string fallback to avoid validation errors - trajectory_prompts = trajectory_data["trajectory_prompts"] - trajectory_completions = trajectory_data["trajectory_completions"] - - # Convert to JSON strings if empty (evaluators expect string when no data) - if not trajectory_prompts: - trajectory_prompts = json.dumps([]) - if not trajectory_completions: - trajectory_completions = json.dumps([]) + # trajectory_prompts and trajectory_completions are dicts with gen_ai.* keys + # Keep them as Python objects for counting; build JSON strings below + trajectory_prompts = trajectory_data["trajectory_prompts"] + trajectory_completions = trajectory_data["trajectory_completions"] @@ - json_trajectory_prompts = json.dumps(trajectory_prompts) - json_trajectory_completions = json.dumps(trajectory_completions) - - return json_trajectory_prompts, json_trajectory_completions, final_completion + if trajectory_prompts: + json_trajectory_prompts = json.dumps(trajectory_prompts) + else: + json_trajectory_prompts = json.dumps([]) + + if trajectory_completions: + json_trajectory_completions = json.dumps(trajectory_completions) + else: + json_trajectory_completions = json.dumps([]) + + return json_trajectory_prompts, json_trajectory_completions, final_completionThis preserves correct counts and produces
"[]"in the “no data” case instead of a double-encoded string.
🧹 Nitpick comments (1)
packages/traceloop-sdk/traceloop/sdk/experiment/utils.py (1)
90-103: Be cautious resettingTracerWrappersingleton and reinitializing Traceloop in-processDeleting
TracerWrapper.instanceand callingTraceloop.init(...)unconditionally is fine for a dedicated experiment runner, but in a long‑lived process (e.g., app using Traceloop elsewhere) this will tear down and replace the existing tracer configuration, potentially surprising callers that already initialized tracing.If this utility is intended strictly for standalone experiments, consider documenting that explicitly; otherwise you may want a code path that:
- Detects an existing initialized tracer and either reuses it, or
- Makes re‑initialization opt‑in via a parameter.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
packages/traceloop-sdk/traceloop/sdk/experiment/utils.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/traceloop-sdk/traceloop/sdk/experiment/utils.py
🧬 Code graph analysis (1)
packages/traceloop-sdk/traceloop/sdk/experiment/utils.py (3)
packages/traceloop-sdk/traceloop/sdk/__init__.py (2)
Traceloop(36-267)init(48-198)packages/traceloop-sdk/traceloop/sdk/utils/in_memory_span_exporter.py (1)
get_finished_spans(40-43)packages/traceloop-sdk/traceloop/sdk/tracing/tracing.py (1)
TracerWrapper(63-227)
🪛 Ruff (0.14.7)
packages/traceloop-sdk/traceloop/sdk/experiment/utils.py
161-161: Consider moving this statement to an else block
(TRY300)
163-164: Remove exception handler; error is immediately re-raised
(TRY203)
164-164: Use raise without specifying exception name
Remove exception name
(TRY201)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Test Packages (3.12)
- GitHub Check: Test Packages (3.10)
- GitHub Check: Lint
- GitHub Check: Test Packages (3.11)
- GitHub Check: Build Packages (3.11)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed e72a010 in 39 seconds. Click for details.
- Reviewed
12lines of code in1files - Skipped
0files when reviewing. - Skipped posting
1draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/traceloop-sdk/pyproject.toml:124
- Draft comment:
Add a comment explaining why 'traceloop/sdk/experiment/utils.py' is excluded from type checking. - Reason this comment was not posted:
Confidence changes required:33%<= threshold50%None
Workflow ID: wflow_7r0bmYjnwTIvhXb0
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
packages/traceloop-sdk/pyproject.toml (1)
124-124: Reconsider broad module exclusion from type checking in a strict mypy configuration.The mypy configuration enforces
disallow_untyped_defs = true(line 105), yet you're excluding an entire new module from type checking. This creates an inconsistency and introduces a potential maintenance burden. The enriched summary mentions "dynamic typing patterns," but excluding a full module is a code smell that suggests untyped code could hide runtime type errors.Consider one of these alternatives:
- Properly type the module to align with the strict configuration (preferred)
- Use narrower exclusion for specific functions/classes that genuinely require dynamic typing (e.g., using
# type: ignorecomments on specific lines)- Document why broad exclusion is necessary if it must remain (e.g., in a comment or ADR)
- Add type stubs for the dynamic parts if the code genuinely can't be typed
Can you clarify why the entire
utils.pymodule requires exclusion from type checking? If there are specific functions with dynamic patterns that resist typing, we could use more targeted suppression instead of excluding the whole module.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
packages/traceloop-sdk/pyproject.toml(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Test Packages (3.11)
- GitHub Check: Test Packages (3.12)
- GitHub Check: Test Packages (3.10)
- GitHub Check: Build Packages (3.11)
- GitHub Check: Lint
Fixes TLP-1261
feat(instrumentation): ...orfix(instrumentation): ....Important
Introduces a travel agent experiment with trajectory capture and evaluation using Traceloop's evaluators, and enhances trajectory data extraction utilities.
travel_agent_exp.pyto demonstrate a travel agent experiment using Traceloop's evaluators.extract_trajectory_from_spans()andrun_with_span_capture()inutils.pyfor capturing and extracting trajectory data.agent_flow_quality()inevaluators_made_by_traceloop.pyto acceptthresholdandconditionsfor finer-grained evaluation.This description was created by
for e72a010. You can customize this summary. It will automatically update as commits are pushed.
Summary by CodeRabbit
New Features
Enhancements
✏️ Tip: You can customize this high-level summary in your review settings.